Text categorization on Reuters corpus

نویسنده

Ivana Lukšová

چکیده

1 Task The task of text categorization can be described as follows: given a set of documents, we want to assign to each document one or more text categories or no category. In this term project, we want categorize documents from the well-known Reuters-21578 corpus which is a collection of 21578 articles published on Reuters in 1987. We have chosen only three most frequent text categories as the target categories: Mergers/Acquisitions (ACQ) Earnings and Earnings Forecasts (EARN) Money/Foreign Exchange (MONEY-FX) Since this categories may overlap, we have at total 8 target labels for each subset of these categories. We want to train a classifier that will assign a target label to a given document. In the text categorization task, we want to compare two different approaches. Both approaches are based on building an ensemble of binary classifiers. In the first approach, we will use a simple majority voting method to assign a target label. In the second approach, we will use the outputs of this ensample as the input for another classifier. Our ensemble of classifiers will consist of 8 classifiers: for each text category (ACQ, EARN, MONEY-FX) we will train 3 binary classifiers with different machine learning methods:  Support Vector Machine (SVM)  Random Forest (RF)  Naïve Bayes (NB) The output of each of these binary classifiers will be TRUE or FALSE – assigning or not assigning a given label. In the majority voting approach, if two of classifiers will output TRUE, we will assign the particular text category to an input document. The target label will be determined as the composition of these assigned text categories. In the second approach, we will train a decision tree that will use these eight binary outputs. The decision tree will directly assign the target label. In the text categorization task, the first step is to transform text documents into a set of features suitable for classifier learning methods. Convenient transformation could be representing the document as a set of words, ignoring their order in the text. Thus we will split the text documents into the words and then each feature will correspond to occurrence of a particular term in the text.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Experiments with multi-label text classifier on the Reuters collection

Text categorization is the classification to assign a text document to an appropriate category in a predefined set of categories. We present an approach on hierarchical text categorization that is a recently emerged subfield of the main topic. Here, documents are assigned to leaf-level categories of a category tree (called taxonomy). The algorithm applies an iterative learning module that allow...

متن کامل

Categorizing Gigabytes: Experiments on the RCV1 Corpus

This paper presents categorization results performed by means of HITEC categorizer tool on the new benchmark document collection of text categorization, the Reuters Corpus Volume 1 (RCV1). RCV1 is an archive of over 800,000 manually categorized newswire stories made available by Reuters in 2000 for research purposes. This collection was released to take place of the Reuters-21578 collection tha...

متن کامل

A Text Categorization Based On A Summarization Extraction

We propose a new approach to text categorization based upon the ideas of summarization. It combines word-based frequency and position method to get categorization knowledge from the title field only. Experimental results indicate that summarization-based categorization can achieve acceptable performance on Reuters news corpus.

متن کامل

Hierarchical vs. flat n-gram-based text categorization: Can we do better?

Hierarchical text categorization (HTC) refers to assigning a text document to one or more most suitable categories from a hierarchical category space. In this paper we present two HTC techniques based on kNN and SVM machine learning techniques for categorization process and byte n-gram based document representation. They are fully language independent and do not require any text preprocessing s...

متن کامل

An Examination of Feature Selection Frameworks in Text Categorization

Feature selection, an important task in text categorization, is used for the purpose of dimensionality reduction. Feature selection basically can be performed locally and globally. For local selection, distinct feature sets are derived from different classes. The number of feature set is thus depended on the number of class. In contrary, only one universal feature set will be used in global fea...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2014

Text categorization on Reuters corpus

نویسنده

چکیده

منابع مشابه

Experiments with multi-label text classifier on the Reuters collection

Categorizing Gigabytes: Experiments on the RCV1 Corpus

A Text Categorization Based On A Summarization Extraction

Hierarchical vs. flat n-gram-based text categorization: Can we do better?

An Examination of Feature Selection Frameworks in Text Categorization

عنوان ژورنال:

اشتراک گذاری